Here we source the bit of code that performs this operation. essentially, this one-line command runs an entire R script file. This can be handy when you load the same libraries in different steps of your analysis
source("./code/install_packages_load_libraries.R")
Here we will load data from 3 species in the Brassicacea family experiment: Arabidopsis thaliana, Brassica oleraceae and Isatis tinctoria. This data has been heavily filtered to reduce complexity and size.
# load object
load("./data/ps_rarefied.Rdata")
#check ps object
ps_rarefied # full ps object
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1000 taxa and 143 samples ]
## sample_data() Sample Data: [ 143 samples by 44 sample variables ]
## tax_table() Taxonomy Table: [ 1000 taxa by 7 taxonomic ranks ]
## refseq() DNAStringSet: [ 1000 reference sequences ]
otu_table(ps_rarefied)[1:5,1:5] # 5 first rows across 5 first columns
## OTU Table: [5 taxa and 5 samples]
## taxa are rows
## 49_16S 50_16S 51_16S 52_16S 53_16S
## bASV_5 90 299 109 87 231
## bASV_7 25 156 154 446 58
## bASV_9 157 191 264 108 145
## bASV_11 337 248 193 216 100
## bASV_12 201 437 272 71 154
tax_table(ps_rarefied)[1:5,] # 5 first rows across all columns
## Taxonomy Table: [5 taxa by 7 taxonomic ranks]:
## Kingdom Phylum Class
## bASV_5 "k__Bacteria" "p__Proteobacteria" "c__Gammaproteobacteria"
## bASV_7 "k__Bacteria" "p__Proteobacteria" "c__Gammaproteobacteria"
## bASV_9 "k__Bacteria" "p__Bacteroidota" "c__Bacteroidia"
## bASV_11 "k__Bacteria" "p__Proteobacteria" "c__Alphaproteobacteria"
## bASV_12 "k__Bacteria" "p__Actinobacteriota" "c__Actinobacteria"
## Order Family Genus
## bASV_5 "o__Burkholderiales" "f__Comamonadaceae" "g__Acidovorax"
## bASV_7 "o__Pseudomonadales" "f__Pseudomonadaceae" "g__Pseudomonas"
## bASV_9 "o__Chitinophagales" "f__Chitinophagaceae" "g__Niastella"
## bASV_11 "o__Sphingomonadales" "f__Sphingomonadaceae" "g__Sphingomonas"
## bASV_12 "o__Streptomycetales" "f__Streptomycetaceae" "g__Streptomyces"
## ASV_id
## bASV_5 "bASV_5"
## bASV_7 "bASV_7"
## bASV_9 "bASV_9"
## bASV_11 "bASV_11"
## bASV_12 "bASV_12"
ps_rarefied@sam_data [1:5,1:5]# metadata
## Target Harvest_ID Block_row_column Sp_speed Stress
## 49_16S 16S 49 1.2.4 M1 Control
## 50_16S 16S 50 2.3.1 M1 Control
## 51_16S 16S 51 3.2.9 M1 Control
## 52_16S 16S 52 4.1.6 M1 Control
## 53_16S 16S 53 5.6.14 M1 Control
For a simple start, let’s create individual heat trees from an individual phyloseq object
# make a copy of you rarefied ps object
ps_rarefied_filt<-ps_rarefied
#remove unecessary taxonomic info ("ASV_id") by updating the tax table with a subset of the tax table
tax_table(ps_rarefied_filt)<-tax_table(ps_rarefied_filt)[,1:6]
# let's remove the "r__"ranks from the taxonomy, they can be useful but will pollute our plot
tax_table(ps_rarefied_filt)[, colnames(tax_table(ps_rarefied_filt))] <- gsub(pattern = "[a-z]__", # regular expression
x = tax_table(ps_rarefied_filt)[, colnames(tax_table(ps_rarefied_filt))], # "df"
replacement = "") # replacement for pattern
# transform from phyloseq to taxmap object
taxmap_obj<-parse_phyloseq(ps_rarefied_filt)
taxmap_obj is a object from the Taxmap class, but can be navigated as a nested list. Explore taxmap_obj with “$” and tab-completion. can you find the representative sequences in the data layer?
# the taxmap object is a data container, just like phyloseq
taxmap_obj
# this object has much more than what you can see on one print, just like much other R output you generate.
taxmap_obj$data$tax_data # taxa tables...
taxmap_obj$taxa # taxonomy names, ranks and codes
Now we can create a heat tree, with colors indicating the number of sequences in each taxa.
The first tree is actually quite busy - depending on your goal, you might want to further filter the input ASVs to a smaller subset like all differentially abundant ASVs or features defined as core microbiome communities
# create a heat tree!
set.seed(1) # set a seed so the layout can be exactly the same every time
heat_tree(.input = taxmap_obj, # your taxmap object from the metacoder package
node_size = n_obs, # n_obs is a function that calculates the number of OTUs per taxon
node_color = n_obs, # this function from metacoder is called with non-standard evaluation (no arguments)
node_label = taxon_names, # labels for the taxons
layout = "davidson-harel", # The primary layout algorithm
initial_layout = "reingold-tilford", # The layout algorithm that initializes node locations
output_file = "./results/heat_tree_test.pdf") # A PDF file you will export this plot to
Can you well which is the most diverse phylum? which are the 3 most diverse families?
check what’s produced with the n_obs argument/function used in the tree above and also check the taxa names
#n_obs is a function that takesyour taxmap object as input, and return the number of sequences in each taxa
n_obs(taxmap_obj)
#in metacoder, ab, ac, ad, etc refer each to a different taxonomy
taxmap_obj$taxa
It might be too dense/complex to look, so let’s remove taxa names that can be unfamiliar like “phylum NB1-J”
# create a heat cleaner tree!
taxmap_obj %>%
metacoder::filter_taxa(grepl(pattern = "^[a-zA-Z]+$", taxon_names)) %>% # remove "odd" taxa
heat_tree(node_label = taxon_names,
node_size = n_obs,
node_color = n_obs,
layout = "davidson-harel",
initial_layout = "reingold-tilford",
output_file = "./results/heat_tree_test_cleaner.pdf")
There are many options to customize these trees, we will only explore a few of them
This tree stops at Order level
# create a heat cleaner tree!
taxmap_obj %>%
metacoder::filter_taxa(grepl(pattern = "^[a-zA-Z]+$", taxon_names)) %>% # remove "odd" taxa
metacoder::filter_taxa(taxon_ranks == "Order", supertaxa = TRUE) %>% # subset to the Order rank
heat_tree(node_label = taxon_names,
node_size = n_obs,
node_color = n_obs,
layout = "davidson-harel",
initial_layout = "reingold-tilford",
output_file = "./results/heat_tree_order.pdf")
This tree has a separated root at different taxonomic levels, and custom colors
# create a heat cleaner tree!
taxmap_obj %>%
metacoder::filter_taxa(grepl(pattern = "^[a-zA-Z]+$", taxon_names)) %>% # remove "odd" taxa
metacoder::filter_taxa(taxon_names %in% c("Rhizobiales", "Actinobacteria", "Xanthomonadaceae"),
subtaxa = TRUE,
supertaxa = FALSE) %>%
heat_tree(node_label = taxon_names,
node_size = n_obs,
node_color = n_obs,
node_color_range = c("blue", "yellow", "pink", "red"),
layout = "davidson-harel",
initial_layout = "reingold-tilford",
output_file = "./results/heat_tree_sep_roots.pdf")
lets now compare ASV abundances between two groups of samples, such as plant species in brassicaceae lineage I and II (Sp_Lineage_Walden).
“tax_abund” is added as a data layer of the taxmap object. it defines the number of sequences on every single taxon - all phyla, all classes, orders… this is different from counts of every single ASV! this data layer is needed on the next step.
#get abundance per taxon
taxmap_obj$data$tax_abund<-calc_taxon_abund(obj = taxmap_obj, # the taxmap object
data = "otu_table", # layer of data table to use
cols = taxmap_obj$data$sample_data$sample_id) #sample names
#check the new data layer you created. note that "ab" refers to "k__Bacteria", and "aac" to "p__Proteobacteria".
taxmap_obj$data$tax_abund
## # A tibble: 457 x 144
## taxon_id `49_16S` `50_16S` `51_16S` `52_16S` `53_16S` 54_16~1 55_16~2 56_16~3
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ab 10000 10000 10000 10000 10000 10000 10000 10000
## 2 ac 7137 7430 6677 7995 7774 7459 7183 7574
## 3 ad 1188 1012 1351 739 769 951 1242 998
## 4 ae 739 1152 1129 295 535 938 1085 829
## 5 af 355 101 349 363 267 186 135 165
## 6 ag 191 82 189 309 286 208 134 147
## 7 ah 42 65 80 22 40 47 28 81
## 8 ai 23 24 43 38 40 21 13 27
## 9 aj 10 7 18 31 12 5 6 0
## 10 ak 16 18 21 13 25 19 27 17
## # ... with 447 more rows, 135 more variables: `57_16S` <dbl>, `58_16S` <dbl>,
## # `59_16S` <dbl>, `60_16S` <dbl>, `61_16S` <dbl>, `62_16S` <dbl>,
## # `63_16S` <dbl>, `64_16S` <dbl>, `65_16S` <dbl>, `66_16S` <dbl>,
## # `67_16S` <dbl>, `68_16S` <dbl>, `69_16S` <dbl>, `70_16S` <dbl>,
## # `71_16S` <dbl>, `72_16S` <dbl>, `73_16S` <dbl>, `74_16S` <dbl>,
## # `75_16S` <dbl>, `76_16S` <dbl>, `77_16S` <dbl>, `78_16S` <dbl>,
## # `79_16S` <dbl>, `80_16S` <dbl>, `81_16S` <dbl>, `82_16S` <dbl>, ...
#quickly compare those values to the OTU table to see the differnce:
taxmap_obj$data$otu_table
## # A tibble: 1,000 x 145
## taxon_id otu_id `49_16S` `50_16S` `51_16S` `52_16S` `53_16S` 54_16~1 55_16~2
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 jq bASV_5 90 299 109 87 231 40 78
## 2 jr bASV_7 25 156 154 446 58 191 376
## 3 js bASV_9 157 191 264 108 145 128 245
## 4 jt bASV_11 337 248 193 216 100 499 403
## 5 ju bASV_12 201 437 272 71 154 240 362
## 6 jv bASV_13 226 178 89 162 115 255 230
## 7 jw bASV_14 374 159 301 230 321 167 229
## 8 fi bASV_15 157 49 72 135 26 199 92
## 9 jy bASV_16 62 373 173 96 144 255 201
## 10 fj bASV_18 317 160 161 143 191 165 182
## # ... with 990 more rows, 136 more variables: `56_16S` <dbl>, `57_16S` <dbl>,
## # `58_16S` <dbl>, `59_16S` <dbl>, `60_16S` <dbl>, `61_16S` <dbl>,
## # `62_16S` <dbl>, `63_16S` <dbl>, `64_16S` <dbl>, `65_16S` <dbl>,
## # `66_16S` <dbl>, `67_16S` <dbl>, `68_16S` <dbl>, `69_16S` <dbl>,
## # `70_16S` <dbl>, `71_16S` <dbl>, `72_16S` <dbl>, `73_16S` <dbl>,
## # `74_16S` <dbl>, `75_16S` <dbl>, `76_16S` <dbl>, `77_16S` <dbl>,
## # `78_16S` <dbl>, `79_16S` <dbl>, `80_16S` <dbl>, `81_16S` <dbl>, ...
here we define taxa “occurrence” by determining in how many samples of each group they appear
#get occurrence of per lineage
taxmap_obj$data$tax_occ_2lineages <- calc_n_samples(obj = taxmap_obj,
data = "tax_abund",
cols = taxmap_obj$data$sample_data$sample_id,
groups = taxmap_obj$data$sample_data$Sp_Lineage_Walden) # What category each sample is
#check the object. it shows how many times a taxon is represented in lineage I or II
taxmap_obj$data$tax_occ_2lineages
## # A tibble: 457 x 3
## taxon_id lineage_II lineage_I
## <chr> <int> <int>
## 1 ab 95 48
## 2 ac 95 48
## 3 ad 95 48
## 4 ae 95 48
## 5 af 95 48
## 6 ag 95 48
## 7 ah 95 48
## 8 ai 95 48
## 9 aj 94 40
## 10 ak 95 48
## # ... with 447 more rows
Here we see log fold changes
# this will calculate log2 median ratios and p values for a wilcoxcon test within taxas in this lineage
taxmap_obj$data$diff_table_2lineages <- compare_groups(obj = taxmap_obj,
data = "tax_abund",
cols = taxmap_obj$data$sample_data$sample_id,
groups = taxmap_obj$data$sample_data$Sp_Lineage_Walden)
#check the object.
taxmap_obj$data$diff_table_2lineages
## # A tibble: 457 x 7
## taxon_id treatment_1 treatment_2 log2_median_ratio media~1 mean_d~2 wilcox_~3
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ab lineage_II lineage_I 0 0 0 NaN
## 2 ac lineage_II lineage_I 0.126 624. 5.56e+2 1.34e-7
## 3 ad lineage_II lineage_I 0.0991 68 -4.57e+0 4.77e-1
## 4 ae lineage_II lineage_I -0.795 -394. -3.58e+2 5.75e-7
## 5 af lineage_II lineage_I 0.0316 5.5 -1.54e-2 9.76e-1
## 6 ag lineage_II lineage_I -0.448 -65.5 -6.67e+1 1.77e-5
## 7 ah lineage_II lineage_I -0.257 -11.5 -1.05e+1 1.93e-1
## 8 ai lineage_II lineage_I -0.665 -20.5 -2.11e+1 1.10e-6
## 9 aj lineage_II lineage_I -0.115 -1 -7.21e-1 7.78e-1
## 10 ak lineage_II lineage_I -0.445 -6.5 -6.32e+0 1.23e-3
## # ... with 447 more rows, and abbreviated variable names 1: median_diff,
## # 2: mean_diff, 3: wilcox_p_value
NOTE that the “func” argument of the compare_groups() function allows you to use a custom function to calculate the significant of differential abudance tests. I have not tried to change the standard values, but you might be able to implement zimbwave in here.
tan indicates lineage II, cyan indicates lineage I
taxmap_obj %>%
metacoder::filter_taxa(grepl(pattern = "^[a-zA-Z]+$", taxon_names)) %>% # remove "odd" taxa
heat_tree(node_label = taxon_names,
node_size = n_obs,
node_color = log2_median_ratio,
node_color_interval = c(-3, 3), # The range of `log2_median_ratio` to display
node_color_range = c("cyan", "gray", "tan"), # The color palette used
layout = "davidson-harel",
node_color_axis_label = "Log2 ratio, cyan for Lineage I",
initial_layout = "reingold-tilford",
output_file = "./results/heat_tree_diff_lineages.pdf")
This plot indicates log fold changes, but includes p values > 0.05. let’s run any non-significant differences into a median log ratio of zero, shading them as grey in the plot
# set differential log ratio to 0 based on p values below 0.05
taxmap_obj$data$diff_table_2lineages$log2_median_ratio[taxmap_obj$data$diff_table_2lineages$wilcox_p_value > 0.05] <- 0
# plot the heatmap
taxmap_obj %>%
metacoder::filter_taxa(grepl(pattern = "^[a-zA-Z]+$", taxon_names)) %>% # remove "odd" taxa
heat_tree(node_label = taxon_names,
node_size = n_obs,
node_color = log2_median_ratio,
node_color_interval = c(-3, 3), # The range of `log2_median_ratio` to display
node_color_range = c("cyan", "gray", "tan"), # The color palette used
layout = "davidson-harel",
node_color_axis_label = "Log2 ratio, cyan for Lineage I",
initial_layout = "reingold-tilford",
output_file = "./results/heat_tree_diff_lineages_pcut.pdf")
From this figure, we could see a larger presence of low-diversity phyla in lineage I (cyan). lineage II has a clear increase in Sphingomonadaceae. both lineages seem to be colonized by several, but different, bacteroidia and actinobacteria. Family Comamonadaceae and Chitinophagaceae seem to be the most diverse
If you have more than 2 groups, like we have 3 different stresses, you can plot a matrix of heat trees with a dedicated function from the same package. Before we do that, let’s first calculate taxon abundance, prevalence, and log-fold changes like we did in chunks 3.5.1 to 3.5.3
#abundance per taxon was already calculated in chunk 3.5.1, now we just overwrite
taxmap_obj$data$tax_abund<-calc_taxon_abund(obj = taxmap_obj,
data = "otu_table",
cols = taxmap_obj$data$sample_data$sample_id)
#get occurrence of ASVs per treatment
taxmap_obj$data$tax_occ_3treatments <- calc_n_samples(obj = taxmap_obj,
data = "tax_abund",
cols = taxmap_obj$data$sample_data$sample_id,
groups = taxmap_obj$data$sample_data$Stress) # now refer to Stress, that has 3 groups
# calculate log2 median ratios and p values for a wilcoxcon test within taxas in this stress treatment groups
taxmap_obj$data$diff_table_3treatments <- compare_groups(obj = taxmap_obj,
data = "tax_abund",
cols = taxmap_obj$data$sample_data$sample_id,
groups = taxmap_obj$data$sample_data$Stress)
# set differential log ratio to 0 based on adjusted p values
taxmap_obj$data$diff_table_3treatments$log2_median_ratio[taxmap_obj$data$diff_table_3treatments$wilcox_p_value > 0.05] <- 0
# because of potential ambiguity in non-standard evaluation, we should change this column name. this is only needed because of the column taxmap_obj$data$diff_table_2lineages$log2_median_ratio we created in chunk 3.5.3
names(taxmap_obj$data$diff_table_3treatments)[4]<-"log2_median_ratio_stress"
Now that we have log-fold differences in pairwise treatment comparisons, let’s compare them in a matrix of heat trees
#plot matrix tree
set.seed(1)
heat_tree_matrix(taxmap_obj,
data = "diff_table_3treatments", # this is the table with the data you want to plot
node_size = n_obs, # n_obs is a function that calculates, in this case, the number of OTUs per taxon
node_label = taxon_names,
node_color = log2_median_ratio_stress, # A column from `taxmap_obj$data$diff_table_3treatments`
node_color_range = diverging_palette(), # The built-in palette for diverging data
node_color_trans = "linear", # The default is scaled by circle area
node_color_interval = c(-3, 3), # The range of `log2_median_ratio` to display
edge_color_interval = c(-3, 3), # The range of `log2_median_ratio` to display
node_size_axis_label = "Number of OTUs",
node_color_axis_label = "Log2 ratio median proportions",
layout = "davidson-harel", # The primary layout algorithm
initial_layout = "reingold-tilford", # The layout algorithm that initializes node locations
output_file = "./results/matrix_heat_tree_Stress.pdf") # Saves the plot as a pdf file
by changing the code above, Make a matrix of heat trees that plots the differences between the 3 plant species we are using. The metadata for that is present in the metadata column “sp_full_name”. Which alphaproteobacteria are more common in Brassica oleracea than Isatis tinctoria?
These trees are interesting, but as you start exploring real data and multiple options you will need a more efficiency way of running all this code. For this, we create a custom function that handles all of it at once,
This custom function will:
1 - truncate the taxonomy table to genus level,
2 - remove the g__ and f__ etc from taxonomy table 3 - parse the
phyloseq object into a taxmap object, 4 - calculate taxa occurrence,
abundance and log-fold differences between groups of samples that you
define 5 - plot a heat tree
It’s arguments are 1 - ps_object = a phyloseq object 2 - sample_group = a (quoted) column from the sample metadata that you want to compare
The function will count the number of different groups in your metadata column to decide if it will plot a heat tree (2 treatments/groups) or matrix of heat trees (3+ treatments). the function should fail if your metadata only has 1 sample group!
phyloseq_to_heat_tree_matrix<-function(ps_object, sample_group){
# this function output is a heat trees comparing metadata. the input is based on a phyloseq object
# ps object = a phyloseq object, containing sample metadata, otu table, and taxonomy table
# sample_group = the name of a column in your emtadata that you want to compare in the heat tree. it has to be quoted.
# this function will create a matrix of heat trees if your metadata has more than 2 groups. it should fail if it has only 1 group
#remove unecessary taxonomic info (dada2id, "s__" and "ASV_id) by updating the tax table with a subset of the tax table
tax_table(ps_object)<-tax_table(ps_object)[,1:6]
# let's remove the "r__"ranks from the taxonomy, they can be useful but will pollute our plot
tax_table(ps_object)[, colnames(tax_table(ps_object))] <- gsub(pattern = "[a-z]__", # regular expression pattern to search for
x = tax_table(ps_object)[, colnames(tax_table(ps_object))], # "df"
replacement = "") # replacement for pattern
# transform from phyloseq to taxmap object
taxmap_obj<-parse_phyloseq(ps_object)
#get abundance per taxon
taxmap_obj$data$tax_abund<-calc_taxon_abund(obj = taxmap_obj,
data = "otu_table",
cols = taxmap_obj$data$sample_data$sample_id)
#get occurrence of ASVs per treatment
# the sample groups needs some wrangling to fit within the soft code of the function
taxmap_obj$data$tax_occ<- calc_n_samples(obj = taxmap_obj,
data = "tax_abund",
cols = taxmap_obj$data$sample_data$sample_id,
groups = taxmap_obj$data$sample_data[colnames(taxmap_obj$data$sample_data)==sample_group][[1]])
# calculate log2 median ratios and p values for a wilcoxcon test within taxas in this stress treatment groups
# the sample groups needs some wrangling to fit within the soft code of the function
taxmap_obj$data$diff_table <- compare_groups(obj = taxmap_obj,
data = "tax_abund",
cols = taxmap_obj$data$sample_data$sample_id,
groups = taxmap_obj$data$sample_data[colnames(taxmap_obj$data$sample_data)==sample_group][[1]])
# set differential log ratio to 0 based on adjusted p values
taxmap_obj$data$diff_table$log2_median_ratio[taxmap_obj$data$diff_table$wilcox_p_value > 0.05] <- 0
# define number of compared factors
factors_compared<-taxmap_obj$data$sample_data[colnames(taxmap_obj$data$sample_data)==sample_group][[1]]
# draw the plot based on an if else statement: if there are 2 groups, plot a a heat tree comparing abundances between both groups, else plot a matrix of ehat trees. this function will fail if you only have 1 sample group!
if (length(unique(factors_compared)) == 2) {
set.seed(1)
output<- taxmap_obj %>%
heat_tree(
node_label = taxon_names,
# data = "diff_table",
node_size = n_obs,
node_color = log2_median_ratio,
node_color_interval = c(-3, 3), # The range of `log2_median_ratio` to display
node_color_range = c("cyan", "gray", "tan"), # The color palette used
layout = "davidson-harel",
initial_layout = "reingold-tilford")
} else {
set.seed(1)
output<-heat_tree_matrix(taxmap_obj,
data = "diff_table", # this is the table with the data you want to plot
node_size = n_obs, # n_obs is a function that calculates, in this case, the number of OTUs per taxon
node_label = taxon_names,
node_color = log2_median_ratio, # A column from `taxmap_obj$data$diff_table_3treatments`
node_color_range = diverging_palette(), # The built-in palette for diverging data
node_color_interval = c(-3, 3), # The range of `log2_median_ratio` to display
edge_color_interval = c(-3, 3), # The range of `log2_median_ratio` to display
node_size_axis_label = "Number of OTUs",
node_color_axis_label = "Log2 ratio median proportions",
layout = "davidson-harel", # The primary layout algorithm
initial_layout = "reingold-tilford") # The layout algorithm that initializes node locations
}
# clearly define the output object you will get from the function
return(output)
}
Now that we defined a new function that takes a phyloseq object as a data input and a column of the metadata as an argument, we can more easily produce heat trees across the variables.
# run custom function
heat_tree_stress<-phyloseq_to_heat_tree_matrix(ps_object = ps_rarefied, sample_group = "Stress")
heat_tree_lineage<-phyloseq_to_heat_tree_matrix(ps_object = ps_rarefied, sample_group = "Sp_Lineage_Walden")
# check stress tree
heat_tree_stress
# check lineage tree
heat_tree_lineage
Now that you know the basics, let’s manipulate some of these visualizations
Now produce new heat tree, based on metadata like “Sp_full_name”, “greenhouse_compartment” or “Speed”!
Our trees so far stopped at genus level, because ploting 2000 ASVs would overcrowd the plot. Show the “Sp_full_name” effect at ASV level, but only on Burkholderiales and Rhizobiales. What do you have to change, and where? You can copy-paste and adjust the code we saw above, or make a new version of the custom function
Our trees so far have k__Bacteria as a shared root. What will happen if you remove that taxonomic information? to make this change, What do you have to alter? You can copy-paste and adjust the code we saw above, or make a new version of the custom function
Now that we have a function that takes a single phyloseq object as input, let’s run it across all the species in the dataset
First, let’s split the phyloseq object into a list of phyloseq objects according the species
note that the phyloseq input is based on the original phyloseq object, that still contains the ranks as p__Proteobacteria, o__Rhizobiales, etc
# turn a ps object into a list f ps objects
ps_l<- metagMisc::phyloseq_sep_variable(ps_rarefied, variable = "Sp_abb_name")
Let’s check the list we just created. there are different ways we can do this
# check object, which is a list of phyloseq objects (ps_l)
ps_l
## $At
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 963 taxa and 48 samples ]
## sample_data() Sample Data: [ 48 samples by 44 sample variables ]
## tax_table() Taxonomy Table: [ 963 taxa by 7 taxonomic ranks ]
## refseq() DNAStringSet: [ 963 reference sequences ]
##
## $Bo_M
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 994 taxa and 48 samples ]
## sample_data() Sample Data: [ 48 samples by 44 sample variables ]
## tax_table() Taxonomy Table: [ 994 taxa by 7 taxonomic ranks ]
## refseq() DNAStringSet: [ 994 reference sequences ]
##
## $It
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 988 taxa and 47 samples ]
## sample_data() Sample Data: [ 47 samples by 44 sample variables ]
## tax_table() Taxonomy Table: [ 988 taxa by 7 taxonomic ranks ]
## refseq() DNAStringSet: [ 988 reference sequences ]
# access the Arabidosps phyloseq object
ps_l$At
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 963 taxa and 48 samples ]
## sample_data() Sample Data: [ 48 samples by 44 sample variables ]
## tax_table() Taxonomy Table: [ 963 taxa by 7 taxonomic ranks ]
## refseq() DNAStringSet: [ 963 reference sequences ]
# access the taxonomy table of the Arabidosps phyloseq object
ps_l$At@tax_table[1:5,]
## Taxonomy Table: [5 taxa by 7 taxonomic ranks]:
## Kingdom Phylum Class
## bASV_5 "k__Bacteria" "p__Proteobacteria" "c__Gammaproteobacteria"
## bASV_7 "k__Bacteria" "p__Proteobacteria" "c__Gammaproteobacteria"
## bASV_9 "k__Bacteria" "p__Bacteroidota" "c__Bacteroidia"
## bASV_11 "k__Bacteria" "p__Proteobacteria" "c__Alphaproteobacteria"
## bASV_12 "k__Bacteria" "p__Actinobacteriota" "c__Actinobacteria"
## Order Family Genus
## bASV_5 "o__Burkholderiales" "f__Comamonadaceae" "g__Acidovorax"
## bASV_7 "o__Pseudomonadales" "f__Pseudomonadaceae" "g__Pseudomonas"
## bASV_9 "o__Chitinophagales" "f__Chitinophagaceae" "g__Niastella"
## bASV_11 "o__Sphingomonadales" "f__Sphingomonadaceae" "g__Sphingomonas"
## bASV_12 "o__Streptomycetales" "f__Streptomycetaceae" "g__Streptomyces"
## ASV_id
## bASV_5 "bASV_5"
## bASV_7 "bASV_7"
## bASV_9 "bASV_9"
## bASV_11 "bASV_11"
## bASV_12 "bASV_12"
# check metadata of the second list object
sample_data(object = ps_l[[2]])[1:5,1:5]
## Target Harvest_ID Block_row_column Sp_speed Stress
## 49_16S 16S 49 1.2.4 M1 Control
## 50_16S 16S 50 2.3.1 M1 Control
## 51_16S 16S 51 3.2.9 M1 Control
## 52_16S 16S 52 4.1.6 M1 Control
## 53_16S 16S 53 5.6.14 M1 Control
lapply (list apply) will apply a function over a list. as we have a list of phyloseq objects now, it will be easy to produce several matrices of heat trees. when working on lists of 33 species we learned that 33 plots are hardly helpful. still, the code and the concept (make a custom function that uses a phyloseq object as input, then run over a list with lapply) can be very useful on your own analysis
# are you new to lapply? see here how it works. you first provide a list ("ps_l"), and then a function ("ntaxa") you will run independently on each element x of your list(ps_l$At, ps_l$Bo_M, ps_l$It). here we simply count the number of taxa of each of the 3 phyloseq objects in the list
lapply(ps_l, function(x)
ntaxa(x))
## $At
## [1] 963
##
## $Bo_M
## [1] 994
##
## $It
## [1] 988
# run the custom function over a list of phyloseq objects... it can take a moment
heat_tree_stress_l<-lapply(ps_l, function(x)
phyloseq_to_heat_tree_matrix (ps_object = x, sample_group = "Stress"))
What if we want to visualize other types of taxa-associated data, like p values from deseq2, importance in random forest predictions, distances in an ordination, or degree in a network? You can do that, as long as the vector of values you use as node_size or node_colour have the same length as the number of taxon names you are plotting. You can for example add another data layer to the taxmap object, or provide data from an independent df as the argument of node_size or node_colour.
To explore this, lets first check some of the taxmap object structure, then create some random data and add it to the heat tree
# let's review one of the more basic trees.
taxmap_obj %>%
heat_tree(node_label = taxon_names,
node_size = n_obs,
node_color = n_obs,
layout = "davidson-harel",
initial_layout = "reingold-tilford")
#let's review what is actually used as input for node size and color
n_obs(taxmap_obj)
## ab ac ad ae af ag ah ai aj ak al am an ap aq ar
## 1000 574 102 84 79 59 15 11 3 3 18 21 8 14 4 1
## as at au av aw ax ay az ba bb bc bd be bf bg bh
## 1 1 359 102 215 64 23 57 13 11 24 3 3 18 18 8
## bj bk bl bm bn bo bp bq br bs bt bu bv bw bx by
## 8 4 2 4 2 5 2 14 24 3 1 1 3 4 1 1
## bz ca cb cc cd ce cf cg ch ci cj ck cl cm cn co
## 1 272 8 53 77 9 82 32 10 62 15 8 57 11 21 1
## cp cq cr cs ct cu cv cw cx cy cz da db dc de df
## 5 17 17 1 6 1 9 3 3 6 8 5 18 8 8 14
## dg dh di dj dk dl dm dn do dp dq dr ds dt du dv
## 2 4 3 2 4 8 2 2 4 2 5 22 8 5 5 2
## dw dx dy dz ea eb ec ed ee ef eg eh ei ej ek el
## 7 1 3 9 6 3 3 2 1 1 1 2 5 1 1 2
## em en eo ep eq er es et eu ew ex ey ez fa fc fd
## 3 1 2 3 1 2 1 1 1 1 1 1 1 1 100 8
## fe ff fg fh fi fj fk fl fm fn fo fp fq fr fs ft
## 52 77 9 25 111 29 31 4 3 44 7 9 18 4 6 57
## fu fv fw fx fy fz ga gb gc gd ge gf gg gh gi gj
## 11 21 1 5 15 4 17 1 6 1 9 3 3 6 8 14
## gk gl gm gn gp gq gr gs gt gu gv gw gx gy gz ha
## 5 16 4 7 4 8 14 19 2 1 3 2 4 4 4 5
## hb hc hd he hf hg hh hi hj hk hl hm hn ho hp hq
## 8 2 10 3 2 2 2 2 2 8 4 2 8 3 5 2
## hr hs ht hu hv hw hx hy hz ia ib ic id ie if ig
## 14 1 1 1 3 2 9 6 3 3 1 1 3 2 1 1
## ih ii ij ik il im in io ip iq ir it iu iv iw ix
## 1 2 5 1 1 3 4 2 2 3 3 3 1 2 3 1
## iy iz ja jb jc jd je jf jg ji jj jk jl jm jn jp
## 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1
## jq jr js jt ju jv jw jy ka kb kc kh kj kk km kn
## 6 8 30 36 8 11 13 67 5 1 1 4 3 2 5 4
## ko kp kq kr ks kt ku kv kw kx ky kz la lb lc ld
## 4 10 9 2 3 4 22 3 4 6 7 1 2 9 1 8
## le lf lg lh li lj lk ll lm ln lo lp lq lr ls lt
## 4 24 27 2 2 1 1 2 5 12 3 5 3 4 9 1
## lu lw lx ly lz ma mb mc me mf mg mh mi mj mk ml
## 1 1 9 3 3 3 3 4 6 6 10 1 2 5 16 4
## mm mn mo mp mq ms mt mv mw my mz na nb nc nd ne
## 2 3 2 6 2 2 8 19 3 1 1 9 3 2 7 2
## nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu
## 3 3 3 5 4 4 2 3 8 1 3 1 4 9 3 2
## nv nw nx ny nz oa ob oc od oe of og oh oi oj ok
## 2 2 10 3 1 2 2 1 4 12 10 1 1 2 2 1
## ol om on oo op oq or os ot ou ov ow ox oy oz pa
## 6 1 2 4 1 1 7 3 2 5 2 14 2 3 1 1
## pb pc pd pe pf pg ph pi pj pk pl pm pn po pp pq
## 1 1 1 3 1 2 9 4 1 1 6 3 1 3 1 1
## pr ps pt pv pw py pz qb qc qd qe qf qg qh qj qk
## 1 1 3 5 2 1 1 1 1 2 3 1 1 3 2 3
## ql qm qn qo qp qq qr qs qt qu qv qx qy qz ra rb
## 1 1 2 2 2 2 4 2 3 1 1 1 1 1 2 5
## rc re rf rg rh ri rj rk rl rn ro rp rq rr rs rt
## 2 3 1 1 1 1 1 1 2 1 2 1 1 1 1 1
## ru rv rw rx ry rz sa sb sc sd se sf sg sh sj sk
## 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1
## sl sm sn so sp sq sr st su
## 1 1 1 1 1 1 1 1 1
# ab, es, ff etc all refer to specific nodes int he tree taxonomic groups. the number referts to how many sequences belong to that taxonomy (all 1000 sequences, are part of Bacteria "ab", 574 are Proteobacteria "ac")
taxmap_obj$taxa[1:10]
## $ab
## <Taxon>
## name: Bacteria
## rank: Kingdom
## id: none
## authority: none
##
## $ac
## <Taxon>
## name: Proteobacteria
## rank: Phylum
## id: none
## authority: none
##
## $ad
## <Taxon>
## name: Bacteroidota
## rank: Phylum
## id: none
## authority: none
##
## $ae
## <Taxon>
## name: Actinobacteriota
## rank: Phylum
## id: none
## authority: none
##
## $af
## <Taxon>
## name: Acidobacteriota
## rank: Phylum
## id: none
## authority: none
##
## $ag
## <Taxon>
## name: Gemmatimonadota
## rank: Phylum
## id: none
## authority: none
##
## $ah
## <Taxon>
## name: Myxococcota
## rank: Phylum
## id: none
## authority: none
##
## $ai
## <Taxon>
## name: Firmicutes
## rank: Phylum
## id: none
## authority: none
##
## $aj
## <Taxon>
## name: Methylomirabilota
## rank: Phylum
## id: none
## authority: none
##
## $ak
## <Taxon>
## name: Armatimonadota
## rank: Phylum
## id: none
## authority: none
So if you have a vector from the same length of n_obs(taxmap_obj) with data from other analysis, you can visualize other outputs in a heat tree! lets prepare a df that will add new, random data
let’s extract the taxon ids from the taxmap object
#extract the names of the taxonomies in the heat tree, accoding the ab ac etc order
taxon_id_metacoder<-lapply(taxmap_obj$taxa, function (x)
x$get_name())%>%
map(1)
# now turn that list into a df
taxon_id_metacoder<-do.call(rbind.data.frame, map(taxon_id_metacoder,1))
# and change column name
colnames(taxon_id_metacoder)<-"taxa_id"
#check output
head(taxon_id_metacoder)
## taxa_id
## 1 Bacteria
## 2 Proteobacteria
## 3 Bacteroidota
## 4 Actinobacteriota
## 5 Acidobacteriota
## 6 Gemmatimonadota
then create new (random) data, and add it to the taxon information you extracted
#make up some random data called new_data1
set.seed(1)
new_data1<-c(runif(n=100, min=1, max=10),
runif(n=200, min=5, max=15),
runif(n=100, min=10, max=20),
runif(n=57, min=20, max=25))
#make up some random data called new_data2
set.seed(1)
new_data2<-c(runif(n=57, min=1000, max=2000),
runif(n=200, min=600, max=1000),
runif(n=200, min=200, max=500))
# add new (random example) data into the df
taxon_id_metacoder$new_analysis_output1<-new_data1
taxon_id_metacoder$new_analysis_output2<-new_data2
#this is the df we created using data external from metacoder
taxon_id_metacoder[1:10,]
## taxa_id new_analysis_output1 new_analysis_output2
## 1 Bacteria 3.389578 1265.509
## 2 Proteobacteria 4.349115 1372.124
## 3 Bacteroidota 6.155680 1572.853
## 4 Actinobacteriota 9.173870 1908.208
## 5 Acidobacteriota 2.815137 1201.682
## 6 Gemmatimonadota 9.085507 1898.390
## 7 Myxococcota 9.502077 1944.675
## 8 Firmicutes 6.947180 1660.798
## 9 Methylomirabilota 6.662026 1629.114
## 10 Armatimonadota 1.556076 1061.786
the size of the node will relate to the first column of the new dataframe. the color of the node will related the the second column of the new dataframe
in a real case, you would have your valuable data, taxon by taxon, in a dataframe with similar structure
# add the new data into the heat tree!
taxmap_obj %>%
heat_tree(node_label = taxon_names,
node_size = taxon_id_metacoder$new_analysis_output1,
node_color = taxon_id_metacoder$new_analysis_output2,
node_size_axis_label = "Size: new data 2",
node_color_axis_label = "Color: new data 1",
layout = "davidson-harel",
initial_layout = "reingold-tilford")
Of course, there are countless ways you can integrate different pieces of data together - such as by matching the taxonomies ab, ac, ad etc; the names of the taxa like “f__Burkholderiales” etc. Note that you won’t find sample code for this (adding external data) in the metacoder package documentation.
In the MeJA applications pilot were we tested different MeJA concentrations, we run different analysis that highlight taxa as “important”: predictors for stress in random forest, keystone nodes in networks, and differentially abundant ASVs in treatment-control comparisons. to select which of the taxonomies of these “important” taxa to focus on, we performed a fisher test to check if the proportions of a taxonomy are similar between the important ASVs and the ASVs that were not defined as important.
The paper with further details should be submitted for publication this by Nov/Dez 2022 ~ you can cite that paper in the near future to use this code
These ASVs listed in “imp_asv_list.Rdata” were highlighted as differential abundant due to stress treatments OR important in random forest predictions for stress treatments OR network keystone/connector/hub of the species in a more complete dataset (with 70k ASVs). I call these the “important”ASVs.
After we load these ASVs names, we create a list of new ps objects by running phyloseq::prune_taxa (a function that selects taxa in a ps object according a character vector) with base::mapply (a function that runs a function on two lists simultaneously)
# load a list of ASV names.
load( file="./data/imp_asv_list.Rdata")
imp_asv_list # as you see, this is just a list of taxa
## $At
## [1] "bASV_9" "bASV_29" "bASV_71" "bASV_97" "bASV_108"
## [6] "bASV_155" "bASV_169" "bASV_175" "bASV_314" "bASV_391"
## [11] "bASV_443" "bASV_674" "bASV_776" "bASV_1058" "bASV_1085"
## [16] "bASV_1526" "bASV_1722" "bASV_2147" "bASV_4416" "bASV_1008"
## [21] "bASV_1035" "bASV_105" "bASV_1054" "bASV_106" "bASV_111"
## [26] "bASV_115" "bASV_117" "bASV_119" "bASV_12" "bASV_120"
## [31] "bASV_1244" "bASV_13" "bASV_132" "bASV_133" "bASV_134"
## [36] "bASV_138" "bASV_1400" "bASV_1421" "bASV_143" "bASV_154"
## [41] "bASV_158" "bASV_162" "bASV_165" "bASV_166" "bASV_190"
## [46] "bASV_192" "bASV_200" "bASV_202" "bASV_205" "bASV_21"
## [51] "bASV_218" "bASV_224" "bASV_226" "bASV_233" "bASV_234"
## [56] "bASV_254" "bASV_274" "bASV_279" "bASV_280" "bASV_289"
## [61] "bASV_292" "bASV_30" "bASV_31" "bASV_319" "bASV_32"
## [66] "bASV_333" "bASV_337" "bASV_339" "bASV_347" "bASV_36"
## [71] "bASV_369" "bASV_37" "bASV_38" "bASV_39" "bASV_395"
## [76] "bASV_40" "bASV_406" "bASV_424" "bASV_4245" "bASV_428"
## [81] "bASV_439" "bASV_442" "bASV_458" "bASV_46" "bASV_48"
## [86] "bASV_493" "bASV_50" "bASV_514" "bASV_515" "bASV_517"
## [91] "bASV_530" "bASV_531" "bASV_557" "bASV_573" "bASV_58"
## [96] "bASV_59" "bASV_599" "bASV_60" "bASV_605" "bASV_61"
## [101] "bASV_622" "bASV_63" "bASV_638" "bASV_64" "bASV_65"
## [106] "bASV_66" "bASV_698" "bASV_7" "bASV_709" "bASV_75"
## [111] "bASV_76" "bASV_762" "bASV_81" "bASV_83" "bASV_878"
## [116] "bASV_892" "bASV_90" "bASV_964" "bASV_98" "bASV_99"
## [121] "bASV_153" "bASV_845" "bASV_1590" "bASV_2049" "bASV_889"
## [126] "bASV_6729" "bASV_10033" "bASV_11577" "bASV_12011" "bASV_22991"
## [131] "bASV_24722" "bASV_27301" "bASV_35730" "bASV_38129" "bASV_67851"
## [136] "bASV_28909" "bASV_7928"
##
## $Bo_M
## [1] "bASV_18" "bASV_29" "bASV_53" "bASV_66" "bASV_67" "bASV_85"
## [7] "bASV_143" "bASV_226" "bASV_286" "bASV_291" "bASV_339" "bASV_417"
## [13] "bASV_418" "bASV_550" "bASV_814" "bASV_999" "bASV_1018" "bASV_1238"
## [19] "bASV_1305" "bASV_1464" "bASV_2760" "bASV_2999" "bASV_3347" "bASV_3617"
## [25] "bASV_102" "bASV_104" "bASV_106" "bASV_110" "bASV_112" "bASV_115"
## [31] "bASV_118" "bASV_12" "bASV_123" "bASV_1245" "bASV_13" "bASV_132"
## [37] "bASV_134" "bASV_1346" "bASV_137" "bASV_139" "bASV_145" "bASV_146"
## [43] "bASV_15" "bASV_164" "bASV_166" "bASV_170" "bASV_174" "bASV_181"
## [49] "bASV_195" "bASV_196" "bASV_203" "bASV_209" "bASV_214" "bASV_223"
## [55] "bASV_230" "bASV_244" "bASV_251" "bASV_255" "bASV_261" "bASV_266"
## [61] "bASV_282" "bASV_309" "bASV_310" "bASV_321" "bASV_326" "bASV_345"
## [67] "bASV_349" "bASV_355" "bASV_358" "bASV_376" "bASV_404" "bASV_439"
## [73] "bASV_45" "bASV_474" "bASV_48" "bASV_481" "bASV_542" "bASV_55"
## [79] "bASV_58" "bASV_590" "bASV_591" "bASV_613" "bASV_62" "bASV_64"
## [85] "bASV_73" "bASV_731" "bASV_754" "bASV_763" "bASV_780" "bASV_793"
## [91] "bASV_839" "bASV_91" "bASV_97" "bASV_98" "bASV_136" "bASV_320"
## [97] "bASV_1185" "bASV_278" "bASV_594" "bASV_1112" "bASV_316" "bASV_664"
## [103] "bASV_775"
##
## $It
## [1] "bASV_11" "bASV_46" "bASV_54" "bASV_55" "bASV_58"
## [6] "bASV_85" "bASV_102" "bASV_109" "bASV_129" "bASV_190"
## [11] "bASV_216" "bASV_223" "bASV_259" "bASV_314" "bASV_330"
## [16] "bASV_400" "bASV_520" "bASV_662" "bASV_716" "bASV_729"
## [21] "bASV_766" "bASV_776" "bASV_838" "bASV_1134" "bASV_1232"
## [26] "bASV_1469" "bASV_1846" "bASV_2793" "bASV_1019" "bASV_1039"
## [31] "bASV_111" "bASV_116" "bASV_12" "bASV_1228" "bASV_124"
## [36] "bASV_1242" "bASV_142" "bASV_149" "bASV_15" "bASV_1532"
## [41] "bASV_154" "bASV_160" "bASV_161" "bASV_162" "bASV_175"
## [46] "bASV_180" "bASV_182" "bASV_1867" "bASV_187" "bASV_1959"
## [51] "bASV_20" "bASV_207" "bASV_21" "bASV_215" "bASV_217"
## [56] "bASV_219" "bASV_230" "bASV_246" "bASV_249" "bASV_25"
## [61] "bASV_261" "bASV_270" "bASV_274" "bASV_277" "bASV_280"
## [66] "bASV_312" "bASV_32" "bASV_323" "bASV_350" "bASV_362"
## [71] "bASV_369" "bASV_372" "bASV_384" "bASV_392" "bASV_40"
## [76] "bASV_407" "bASV_425" "bASV_438" "bASV_44" "bASV_443"
## [81] "bASV_45" "bASV_451" "bASV_476" "bASV_477" "bASV_496"
## [86] "bASV_504" "bASV_508" "bASV_52" "bASV_569" "bASV_589"
## [91] "bASV_590" "bASV_593" "bASV_601" "bASV_602" "bASV_62"
## [96] "bASV_682" "bASV_69" "bASV_691" "bASV_705" "bASV_725"
## [101] "bASV_76" "bASV_768" "bASV_812" "bASV_840" "bASV_935"
## [106] "bASV_946" "bASV_210" "bASV_1006" "bASV_515" "bASV_2687"
## [111] "bASV_473" "bASV_5751" "bASV_4308" "bASV_8032" "bASV_30283"
## [116] "bASV_59443" "bASV_604" "bASV_1093" "bASV_394" "bASV_1002"
# filter your list of phyloliseq objects to only contain the ASVs defined as "importat"
imp_ASV_ps_l<-mapply (function (list_1,list_2) #mapply will let you manipulate 2 or more lists at once
prune_taxa(taxa = list_1, x = list_2), #here, x as the name of an argument of prune_taxa, and refers to a phyloseq object
list_1 = imp_asv_list, # here you define what R object is list_1
list_2 = ps_l, # here you define what R object is list_2
SIMPLIFY = FALSE)
# this ps object only has "important" ASVs
imp_ASV_ps_l
## $At
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 114 taxa and 48 samples ]
## sample_data() Sample Data: [ 48 samples by 44 sample variables ]
## tax_table() Taxonomy Table: [ 114 taxa by 7 taxonomic ranks ]
## refseq() DNAStringSet: [ 114 reference sequences ]
##
## $Bo_M
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 92 taxa and 48 samples ]
## sample_data() Sample Data: [ 48 samples by 44 sample variables ]
## tax_table() Taxonomy Table: [ 92 taxa by 7 taxonomic ranks ]
## refseq() DNAStringSet: [ 92 reference sequences ]
##
## $It
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 102 taxa and 47 samples ]
## sample_data() Sample Data: [ 47 samples by 44 sample variables ]
## tax_table() Taxonomy Table: [ 102 taxa by 7 taxonomic ranks ]
## refseq() DNAStringSet: [ 102 reference sequences ]
NOTE that the input is just a list of ASV names. you could use this function to summarize any other analysis method that tags several ASVs as relevant in your analysis
If we have the time, we explore details for these functions. their description and comments should be fairly sufficiency
the fisher_all_taxa_groups function requires as input: 1) a phyloseq object that contains your “important” taxa 2) a phyloseq objects that contains all your taxa
the fisher_to_heatTree function requires as input 1) the output of the fisher_all_taxa_groups function 2) a phyloseq object that contains your “important” taxa
is a specific taxonomy overly frequent in the “important” dataset when compared to the taxonomies of the ASV that were NOT defined as important?
note that this phyloseq input is based on the original phyloseq object, that still contains the ranks as p__Proteobacteria, o__Rhizobiales, etc
# how many rhizobiales do we have in full set of 963 ASVs of A. thaliana?
ntaxa(physeq = subset_taxa(ps_l$At, Order == "o__Rhizobiales"))
## [1] 81
# how many rhizobiales do we have in the group of 114 "important" ASVs for A. thaliana?
ntaxa(physeq = subset_taxa(imp_ASV_ps_l$At, Order == "o__Rhizobiales"))
## [1] 18
# how many rhizobiales do we have in the set of 849 A. thaliana ASVs that were not tagged as important ?
unimp_ASV_ps_at<-prune_taxa(!(taxa_names(ps_l$At) %in% imp_asv_list$At), ps_l$At)
ntaxa(physeq = subset_taxa(unimp_ASV_ps_at, Order == "o__Rhizobiales"))
## [1] 63
is 18 in 114 (~15%) similar to 63 in 849 (~7%), considering we have in total 18+63 (81) rhizobiales in 114+849 (983) bASVs? if these proportions are statistically different, rhizobiales may be over represented in the important ASV set.
What are we actually testing here? if the important ASVs are just a random subset of the complete dataset, the proportion of rhizobiales in both datasets should be similar
In other words: the chance that these many rhizobiales were tagged as “important” just because there is a lot of rhizobiales in the full/background dataset is too small.
by running the functions defined in chunk #6.2, we test the proportions stated in chunk # 6.3 - on all taxonomic levels of your input phyloseq object
# run a fisher test comparin these proportions at all taxonomic levels
fisher_output_at<-fisher_all_taxa_groups(ps_important_taxa = imp_ASV_ps_l$At,
ps_all_taxa = ps_l$At)
#check raw ouput of fisher tests for rhizobiales
fisher_output_at[1:3]
## $p__Acidobacteriota
##
## Fisher's Exact Test for Count Data
##
## data: matrix(c(target_in_important_n, all_taxa_in_important_n - target_in_important_n, target_in_all_n - target_in_important_n, all_taxa_in_all_n - all_taxa_in_important_n - target_in_all_n), ncol = 2)
## p-value = 0.8474
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
## 0.3153814 Inf
## sample estimates:
## odds ratio
## 0.7106417
##
##
## $p__Actinobacteriota
##
## Fisher's Exact Test for Count Data
##
## data: matrix(c(target_in_important_n, all_taxa_in_important_n - target_in_important_n, target_in_all_n - target_in_important_n, all_taxa_in_all_n - all_taxa_in_important_n - target_in_all_n), ncol = 2)
## p-value = 0.9985
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
## 0.06810393 Inf
## sample estimates:
## odds ratio
## 0.2590167
##
##
## $p__Bacteroidota
##
## Fisher's Exact Test for Count Data
##
## data: matrix(c(target_in_important_n, all_taxa_in_important_n - target_in_important_n, target_in_all_n - target_in_important_n, all_taxa_in_all_n - all_taxa_in_important_n - target_in_all_n), ncol = 2)
## p-value = 0.09193
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
## 0.9037439 Inf
## sample estimates:
## odds ratio
## 1.557417
fisher_output_at$o__Rhizobiales
##
## Fisher's Exact Test for Count Data
##
## data: matrix(c(target_in_important_n, all_taxa_in_important_n - target_in_important_n, target_in_all_n - target_in_important_n, all_taxa_in_all_n - all_taxa_in_important_n - target_in_all_n), ncol = 2)
## p-value = 0.004991
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
## 1.345852 Inf
## sample estimates:
## odds ratio
## 2.283077
# with map() we can easely extract key values from this output, like so:
str(fisher_output_at$o__Rhizobiales) # here we see that p values are the first list element of each taxa ouput
## List of 7
## $ p.value : num 0.00499
## $ conf.int : num [1:2] 1.35 Inf
## ..- attr(*, "conf.level")= num 0.95
## $ estimate : Named num 2.28
## ..- attr(*, "names")= chr "odds ratio"
## $ null.value : Named num 1
## ..- attr(*, "names")= chr "odds ratio"
## $ alternative: chr "greater"
## $ method : chr "Fisher's Exact Test for Count Data"
## $ data.name : chr "matrix(c(target_in_important_n, all_taxa_in_important_n - target_in_important_n, target_in_all_n - target_in_im"| __truncated__
## - attr(*, "class")= chr "htest"
map(fisher_output_at, 1) # here we extarct the first list element of each taxa in the list (p values)
## $p__Acidobacteriota
## [1] 0.8473672
##
## $p__Actinobacteriota
## [1] 0.9985258
##
## $p__Bacteroidota
## [1] 0.09193391
##
## $p__Gemmatimonadota
## [1] 0.8433388
##
## $p__Myxococcota
## [1] 0.09167539
##
## $p__Proteobacteria
## [1] 0.1964258
##
## $c__Acidobacteriae
## [1] 0.4959192
##
## $c__Alphaproteobacteria
## [1] 0.3283449
##
## $c__Bacteroidia
## [1] 0.09193391
##
## $c__Gammaproteobacteria
## [1] 0.1752416
##
## $c__Gemmatimonadetes
## [1] 0.8197096
##
## $c__Polyangia
## [1] 0.0578331
##
## $o__Burkholderiales
## [1] 0.5466162
##
## $o__Caulobacterales
## [1] 0.3049718
##
## $o__Chitinophagales
## [1] 0.2735582
##
## $o__Cytophagales
## [1] 0.2593899
##
## $o__Flavobacteriales
## [1] 0.0578331
##
## $o__Gammaproteobacteria_Incertae_Sedis
## [1] 0.0584339
##
## $o__Gemmatimonadales
## [1] 0.8197096
##
## $o__Rhizobiales
## [1] 0.004991002
##
## $o__Sphingomonadales
## [1] 0.9912935
##
## $o__Xanthomonadales
## [1] 0.09126703
##
## $f__Caulobacteraceae
## [1] 0.2803993
##
## $f__Chitinophagaceae
## [1] 0.2554614
##
## $f__Comamonadaceae
## [1] 0.07230109
##
## $f__Devosiaceae
## [1] 0.00597422
##
## $f__Gemmatimonadaceae
## [1] 0.8197096
##
## $f__Oxalobacteraceae
## [1] 0.3631943
##
## $f__Rhizobiaceae
## [1] 0.05558943
##
## $f__Rhodanobacteraceae
## [1] 0.3626101
##
## $f__Sphingomonadaceae
## [1] 0.9912935
##
## $f__Unknown_Family
## [1] 0.0584339
##
## $f__Xanthobacteraceae
## [1] 0.1200347
##
## $f__Xanthomonadaceae
## [1] 0.1163108
##
## $g__Acidibacter
## [1] 0.0584339
##
## $`g__Allorhizobium-Neorhizobium-Pararhizobium-Rhizobium`
## [1] 0.1323728
##
## $g__Gemmatimonas
## [1] 0.3740159
##
## $g__Lysobacter
## [1] 0.1097332
##
## $g__Massilia
## [1] 0.3344271
##
## $g__Niastella
## [1] 0.01960319
map(fisher_output_at, 3) # here we extarct the third list element of each taxa in the list (odds ratio)
## $p__Acidobacteriota
## odds ratio
## 0.7106417
##
## $p__Actinobacteriota
## odds ratio
## 0.2590167
##
## $p__Bacteroidota
## odds ratio
## 1.557417
##
## $p__Gemmatimonadota
## odds ratio
## 0.684851
##
## $p__Myxococcota
## odds ratio
## 2.752893
##
## $p__Proteobacteria
## odds ratio
## 1.223502
##
## $c__Acidobacteriae
## odds ratio
## 1.176164
##
## $c__Alphaproteobacteria
## odds ratio
## 1.138922
##
## $c__Bacteroidia
## odds ratio
## 1.557417
##
## $c__Gammaproteobacteria
## odds ratio
## 1.233855
##
## $c__Gemmatimonadetes
## odds ratio
## 0.7134832
##
## $c__Polyangia
## odds ratio
## 3.37105
##
## $o__Burkholderiales
## odds ratio
## 0.9959746
##
## $o__Caulobacterales
## odds ratio
## 1.442557
##
## $o__Chitinophagales
## odds ratio
## 1.366568
##
## $o__Cytophagales
## odds ratio
## 1.876752
##
## $o__Flavobacteriales
## odds ratio
## 3.37105
##
## $o__Gammaproteobacteria_Incertae_Sedis
## odds ratio
## 4.534428
##
## $o__Gemmatimonadales
## odds ratio
## 0.7134832
##
## $o__Rhizobiales
## odds ratio
## 2.283077
##
## $o__Sphingomonadales
## odds ratio
## 0.3304971
##
## $o__Xanthomonadales
## odds ratio
## 1.718467
##
## $f__Caulobacteraceae
## odds ratio
## 1.50201
##
## $f__Chitinophagaceae
## odds ratio
## 1.400058
##
## $f__Comamonadaceae
## odds ratio
## 1.602046
##
## $f__Devosiaceae
## odds ratio
## 22.69165
##
## $f__Gemmatimonadaceae
## odds ratio
## 0.7134832
##
## $f__Oxalobacteraceae
## odds ratio
## 1.158663
##
## $f__Rhizobiaceae
## odds ratio
## 2.542972
##
## $f__Rhodanobacteraceae
## odds ratio
## 1.496563
##
## $f__Sphingomonadaceae
## odds ratio
## 0.3304971
##
## $f__Unknown_Family
## odds ratio
## 4.534428
##
## $f__Xanthobacteraceae
## odds ratio
## 1.978932
##
## $f__Xanthomonadaceae
## odds ratio
## 1.790011
##
## $g__Acidibacter
## odds ratio
## 4.534428
##
## $`g__Allorhizobium-Neorhizobium-Pararhizobium-Rhizobium`
## odds ratio
## 2.826685
##
## $g__Gemmatimonas
## odds ratio
## 1.35984
##
## $g__Lysobacter
## odds ratio
## 2.229048
##
## $g__Massilia
## odds ratio
## 1.247307
##
## $g__Niastella
## odds ratio
## 2.805325
this functions will now take p values and oods ratio out of the fisher test output and into the heat tree
# extract p values and odds ration from the fisher tests and put it into the heat tree
fisher_to_heatTree(fisher_output = fisher_output_at,
ps_important_taxa = imp_ASV_ps_l$At)
What are we seeing here? each tree shows the full taxonomy of all ASVs that were “important” because they are: 1) differentially abundant in stress treatment pairwise comparison, OR; 2) predictors of stress treatments in a random forest, OR; 3) defined as keystone taxa, module hubs or module connectors in a network
The colored taxa shows adjusted p-values from the fisher test (p >0.15 is turned to grey)
By comparing the full taxonomies of these important ASVs with the the full taxonomies of all ASVs that were NOT “important”, we see that o__Rhizobiales and f__Devosiaceae are much too frequent in the “important” ASV dataset. This is likely because they consistently have major roles in stress response and community structure, wich was determined by these 3 analysis methods. note that for this summary we don’t consider ASV abundance (number of reads/sequences of each ASV) as it was already accounted for in the other 3 methods.
now, run these custom functions over lists of phyloseq objects separated by species with mapply
# run the custom function voer 2 lists of philoseq objects, one with imporntat taxa and other with the full taxa (for every partition)
fisher_result_l<-mapply(function (list_1,list_2)
fisher_all_taxa_groups(ps_important_taxa = list_1,
ps_all_taxa = list_2),
list_1 = imp_ASV_ps_l,
list_2 = ps_l,
SIMPLIFY = FALSE)
#run the custom function on a list of 3 species
fisher_summaries_3l<-mapply(function(list_1,list_2)
fisher_to_heatTree(fisher_output = list_1,
ps_important_taxa = list_2),
list_1 = fisher_result_l,
list_2 = imp_ASV_ps_l,
SIMPLIFY = FALSE)
To simply the evaluation, let’s look into the output species by species
# summary in A. thaliana
fisher_summaries_3l$At
# summary in B oleraceae
fisher_summaries_3l$Bo_M
# summary in I. tinctoria
fisher_summaries_3l$It
What is the output we have here? let’s word it in different ways
In A. thaliana, o__Rhizobiales and f__Devosiaceae are found too often in the importnat ASV subset
In B oleraceae, the proportion of o__Rhizobiales and f__Beijerinckiaceae is significantly higher in the important ASV dataset when considering the full ASV dataset
Do you see that branch from o__Streptomycetales on in I. tinctoria? considering the full dataset, there are too many representatives of that branch in the imporant ASV subset
These summary trees were not particularly highlighted as significant because the base operations to calculate differential abundance, random forest, and networks, was based on a much larger dataset with 70k ASVs, and I did not have the time to re-run it for this 1k ASV dataset.
I hope you found the workshop insightful and that you will find the code useful as a template you can improve. feel free to share this with collegues. as you know, you will need many microbiome tutorials for a complete, updated analysis.
Are you curious about all the data wrangling of these 2 custom functions? here you can execute a hard-coded version of the function, where you can visualize each object step by step.